Let's assume a linear relationship between number of Uber employees in an elevator and the time it takes you to get to reach the 18th Floor.
How do we find x?
One way of solving this problem, is to directly find x, where $x=6$ (i.e., for every uber employee in the elevator, you should expect to wait an extra 6 seconds)
Let's multiply the x by 20; 0.03*20=0.6
error = 30 - 5 * 0.6
error = 27
This is till not great. We doubled the guessed value, but our error decreased only by about 5% (1-27/28.5). If we really want to make some progress, we need to increase the value by a lot more. We can use the change in the error to figure out by how much x should increase.
There are a lot more activation functions out there. See this answer on CrossValidated.
Input to the first neuron in layer 2 is: $$x=(1*0.4) + (-0.6*0.6)$$ $$x=0.04$$
Output is: $$y(x)=\frac{1}{1+e^{-0.04}}$$ $$y(x)= 0.5099$$
Input to the second neuron in layer 2 is: $$x=(1*0.9) + (-0.6*0.3)$$ $$x=0.72$$
Output is: $$y(x)=\frac{1}{1+e^{-0.04}}$$ $$y(x)= 0.6726$$
In [15]:
### matrix multiplication using numpy
import numpy as np
W = np.array([[0.4, 0.6], [0.9, 0.3]])
I = np.array([1, -0.6])
X=np.dot(W, I)
X
Out[15]:
$$ error_{hidden}= \left[ {\begin{array}{cc} \frac{w_{1,1}}{w_{1,1}+w_{2,1}} & \frac{w_{1,2}}{w_{1,2}+w_{2,2}} \\ \frac{w_{2,1}}{w_{2,1}+w_{1,1}} & \frac{w_{2,2}}{w_{2,2}+w_{1,2}} \\ \end{array} } \right] \cdot \left[ {\begin{array}{cc} e_{1} \\ e_{2} \\ \end{array} } \right] $$
Although mathematically normalizing the weights is the right thing to do, in practice you can acctually ignore that. This makes matrix operations much simpler, with very little cost to the time it takes to train a network.
You've gotten pretty far, so you should take a breather.
What comes next is the hardest part of all, but we need to ge through it.
Remember at the start of the talk we used the rate of change in the error rates between two iterations to adjust parameter to number of Uber employees? We'll, that's not a good way of doing it.
You can make your own customized cost functions that include weights or that punish the model for making strong inaccurate predictions (i.e., logloss). See more here. You can find more about their implementation by examinining the keras source code.
We can rewrite $$\frac{\partial E}{\partial w_{jk}} = \frac{\partial E}{\partial o_{k}} * \frac{\partial o_{k}}{\partial w_{jk}}$$
Taking the derivative of the error, we get: $$\frac{\partial E}{\partial w_{jk}} = -2(t_{k}-o_{k}) * \frac{\partial o_{k}}{\partial w_{jk}}$$
$o_{k}$ also need to be differentiated. Remember that o{k} is equal with: $$S(\sum{1}^{j} w{jk}o{j})$$ where, $S$ is the sigmoid.
Fortunately the derivative of the sigmoid is fairly easy to compute: $$S(x) = S(x)(1-S(x))$$
We still have to take the derivative of the inner expression: $\sum_{1}^{j} w_{jk}o_{j}$
That comes to be $o_{j}$. Does anyone know why?
Our final expression comes to $$\frac{\partial E}{\partial w_{jk}} = -(t_{k}-o_{k}) * S(\sum_{1}^{j} w_{jk}o_{j}) * (1-S(\sum_{1}^{j} w_{jk}o_{j}))* o_{j}$$
Notice that we dropped the constant 2 because in the grand scheme of things it doesn't really channge much because, we we will see later, we use a learning rate parameter to adjust how fast the gradient should change.
In [16]:
import numpy as np
import scipy.special
# create the neural network
class network:
def __init__(self, inputnodes, hiddennodes, outputnodes, learning_rate):
# set the number of nodes for each input, hidden, and output layer
self.inodes = inputnodes
self.hnodes = hiddennodes
self.onodes = outputnodes
# create weights linking the nodes from one layer to another
self.wih = np.random.normal(0.0, pow(self.hnodes, -0.5), (self.hnodes, self.inodes))
self.who = np.random.normal(0.0, pow(self.onodes, -0.5), (self.onodes, self.hnodes))
# set the learning rate
self.lr = learning_rate
# set the activation function
self.activation_function = lambda x: scipy.special.expit(x)
pass
def train(self, inputs_list, target_list):
inputs = np.array(inputs_list, ndmin=2).T
targets = np.array(target_list, ndmin=2).T
# calculate signals into the hidden layer
hidden_inputs = np.dot(self.wih, inputs)
# calculate the weights out of the hidden layer
hidden_outputs = self.activation_function(hidden_inputs)
# calculate signals into the final output layer
final_inputs = np.dot(self.who, hidden_outputs)
final_outputs = self.activation_function(final_inputs)
# get the output errror
output_errors = targets - final_outputs
# hidden layer errors
hidden_errors = np.dot(self.who.T, output_errors)
# update the weights using the backpropagation formula for the middle layer to final layer
self.who += self.lr * np.dot((output_errors * final_outputs * (1-final_outputs)),
np.transpose(hidden_outputs))
# update the weights for the wih layer
self.wih += self.lr * np.dot((hidden_errors * hidden_outputs * (1-hidden_outputs)), np.transpose(inputs))
pass
def predict(self, input_list):
inputs = np.array(input_list, ndmin=2).T
# calculate signals into the hidden layer
hidden_inputs = np.dot(self.wih, inputs)
# calculate the weights out of the hidden layer
hidden_outputs = self.activation_function(hidden_inputs)
# calculate signals into the final output layer
final_inputs = np.dot(self.who, hidden_outputs)
final_outputs = self.activation_function(final_inputs)
return final_outputs
In [17]:
# read the train and test data
# you can get the data from here: https://www.kaggle.com/c/digit-recognizer/data
import pandas as pd
train = pd.read_csv('data/mnist_train.csv', header=None)
test = pd.read_csv('data/mnist_test.csv', header=None)
x_train = train.loc[:, 1:]
y_train = train.loc[:, 0]
x_test = test.loc[:, 1:]
y_test = test.loc[:, 0]
In [18]:
pd.set_option('display.max_columns', 30)
pd.DataFrame(np.array(x_train.iloc[1]).reshape(28, 28))
Out[18]:
In [19]:
# plot an example
import matplotlib.pyplot as plt
%matplotlib inline
index = 1
image_array = np.asfarray(x_train.iloc[index].values.reshape((28, 28)))
plt.imshow(image_array, cmap='Greys', interpolation = 'None')
print("actual outcome value: {}".format(y_train[index]))
In [20]:
# scale the input
x_train_scaled = x_train/255 * 0.99 + 0.01
x_test_scaled = x_test/255 * 0.99 + 0.01
In [21]:
### Setting up the network
input_nodes = 784
hidden_nodes = 100
output_nodes = 10
# learning rate
learning_rate = 0.2
# create the instance of the model
net = network(input_nodes, hidden_nodes, output_nodes, learning_rate)
train_targets = np.zeros((len(y_train), output_nodes)) + 0.01
train_targets[np.array(range(0, len(y_train))), y_train.values] = 0.99
We are implementing a type of gradient descent called stochastic gradient descent. This basically means that we are updating the weights after each training example. There are other types of gradient descent, and I highly recommend reading this paper by Ruder (2016) for an overview on different types of gradient descent.
Stochastic gradient descent has a few benefits.
Obviously, it is not the best numerical optimization algorithm out there. I recommend reading about ADAM (Kingma, Lei Ba 2015), which implements momentum and "memory" like strategies that allow for very rapid convergence with smaller likelihood at getting stuck within sub-optimal solutions.
In [22]:
for features, outcome in zip(x_train_scaled.values, train_targets):
net.train(inputs_list=features, target_list=outcome)
In [23]:
predictions = np.argmax(net.predict(x_test_scaled).T, axis=1)
accuracy = np.mean(predictions == y_test.values)
print("observed accuracy {}".format(accuracy*100))
In [10]:
### train for more epochs
epochs = 4
for epoch in range(epochs):
print("Starting epoch {}".format(epoch+1))
for features, outcome in zip(x_train_scaled.values, train_targets):
net.train(inputs_list=features, target_list=outcome)
In [11]:
predictions = np.argmax(net.predict(x_test_scaled).T, axis=1)
accuracy = np.mean(predictions == y_test.values)
print("observed accuracy {}".format(accuracy*100))
In [25]:
# Doing the same thing with an open source library: [Keras](https://keras.io/)
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
batch_size = 1
num_classes = 10
epochs = 2
# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
model = Sequential()
model.add(Dense(100, activation='sigmoid', input_shape=(784,)))
model.add(Dense(10, activation='sigmoid'))
model.summary()
model.compile(loss='mean_squared_error',
optimizer=SGD(),
metrics=['accuracy'])
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
In [ ]: